Text Mining - Visualizations - Shakespeare's Plays

In this notebook, you will be introduced to some text visualizations. Just as with numerical data science, visual representation plays an important role in text analysis. We are somewhat hampered by the lack of numerical values, but there are outlets.

You will be working with the complete set of Shakespearean plays, categorized into Comedies, Tragedies, and Histories.

In this notebook, we illustrate the importance of sectioning in the reporting stage by NOT providing a table of contents with navigational aids.

You'll also notice that we didn't suggest exercises, as visualizations will pop out throughout the other notebooks.


We start by initializing our trusty companions, the tm and qdap libraries. Other libraries will join them shortly.

In [1]:
library('tm')
library('qdap') # for cleaning, barcharts, and word networks
Loading required package: NLP
Loading required package: qdapDictionaries
Loading required package: qdapRegex
Loading required package: qdapTools
Loading required package: RColorBrewer

Attaching package: ‘qdap’

The following objects are masked from ‘package:tm’:

    as.DocumentTermMatrix, as.TermDocumentMatrix

The following object is masked from ‘package:NLP’:

    ngrams

The following object is masked from ‘package:base’:

    Filter


First, we create corpora for the three categories of Shakespearean plays (conveniently saved in the Data/ShakespeareComedies/, Data/ShakespeareTragedies/, and Data/ShakespeareHistories/ folders).

In [2]:
corpus_C <- Corpus(DirSource("Data/ShakespeareComedies/"), readerControl=list(language="lat")) #load in documents
corpus_T <- Corpus(DirSource("Data/ShakespeareTragedies/"), readerControl=list(language="lat")) #load in documents
corpus_H <- Corpus(DirSource("Data/ShakespeareHistories/"), readerControl=list(language="lat")) #load in documents
summary(corpus_C)
summary(corpus_T)
summary(corpus_H)
Out[2]:
                                          Length Class             Mode
A_Midsummer_Nights_Dream_first_pass.txt   2      PlainTextDocument list
Alls_Well_That_Ends_Well_first_pass.txt   2      PlainTextDocument list
As_You_Like_It_first_pass.txt             2      PlainTextDocument list
Cymbeline_first_pass.txt                  2      PlainTextDocument list
Loves_Labours_Lost_first_pass.txt         2      PlainTextDocument list
Measure_for_Measure_first_pass.txt        2      PlainTextDocument list
Much_Ado_About_Nothing_first_pass.txt     2      PlainTextDocument list
Pericles_Prince_of_Tyre_first_pass.txt    2      PlainTextDocument list
Taming_of_the_Shrew_first_pass.txt        2      PlainTextDocument list
The_Comedy_of_Errors_first_pass.txt       2      PlainTextDocument list
The_Merchant_of_Venice_first_pass.txt     2      PlainTextDocument list
The_Merry_Wives_of_Windsor_first_pass.txt 2      PlainTextDocument list
The_Tempest_first_pass.txt                2      PlainTextDocument list
Troilus_and_Cressida_first_pass.txt       2      PlainTextDocument list
Twelfth_Night_first_pass.txt              2      PlainTextDocument list
Two_Gentlemen_of_Verona_first_pass.txt    2      PlainTextDocument list
Winters_Tale_first_pass.txt               2      PlainTextDocument list
Out[2]:
                                    Length Class             Mode
Antony_and_Cleopatra_first_pass.txt 2      PlainTextDocument list
Coriolanus_first_pass.txt           2      PlainTextDocument list
Hamlet_first_pass.txt               2      PlainTextDocument list
Julius_Caesar_first_pass.txt        2      PlainTextDocument list
King_Lear_second_pass.txt           2      PlainTextDocument list
Macbeth_first_pass.txt              2      PlainTextDocument list
Othello_first_pass.txt              2      PlainTextDocument list
Romeo_and_Juliet_first_pass.txt     2      PlainTextDocument list
Timon_of_Athens_first_pass.txt      2      PlainTextDocument list
Titus_Andronicus_first_pass.txt     2      PlainTextDocument list
Out[2]:
                               Length Class             Mode
Henry_IV_part_1_first_pass.txt 2      PlainTextDocument list
Henry_IV_part_2_first_pass.txt 2      PlainTextDocument list
Henry_V_first_pass.txt         2      PlainTextDocument list
Henry_VI_part_1_first_pass.txt 2      PlainTextDocument list
Henry_VI_part_2_first_pass.txt 2      PlainTextDocument list
Henry_VI_part_3_first_pass.txt 2      PlainTextDocument list
Henry_VIII_first_pass.txt      2      PlainTextDocument list
King_John_first_pass.txt       2      PlainTextDocument list
Richard_II_first_pass.txt      2      PlainTextDocument list
Richard_III_first_pass.txt     2      PlainTextDocument list

Then we build a cleaning function for the corpora.

In [3]:
# A cleaning function for the corpora
clean_corpus <- function(corpus){
  corpus <- tm_map(corpus, removePunctuation)
  corpus <- tm_map(corpus, removeNumbers)
  corpus <- tm_map(corpus, stemDocument, language="english")
  corpus <- tm_map(corpus, content_transformer(tolower))
  corpus <- tm_map(corpus, stripWhitespace)
  corpus <- tm_map(corpus, removeWords, c(stopwords("english"), c("I", "and", "the", "that", "thou", "thee", "thi"))) 
  return(corpus)
}

clean_C = clean_corpus(corpus_C)
clean_T = clean_corpus(corpus_T)
clean_H = clean_corpus(corpus_H)
In [4]:
# Find the 20 most frequent terms: term_count
term_count_C <- freq_terms(clean_C,20)
term_count_T <- freq_terms(clean_T,20)
term_count_H <- freq_terms(clean_H,20)

# Plot term_count
plot(term_count_C)
plot(term_count_T)
plot(term_count_H)
Out[4]:
Out[4]:
Out[4]:

We can also take a look at some basic statistics regarding the number of characters and the number of words in each play.

In [5]:
# Statistics on the corpora's number of characters 
length_of_plays_char_C <- vector(mode="numeric", length=17)
for(j in 1:17){length_of_plays_char_C[j]=nchar(clean_C[[j]][1])}
hist(length_of_plays_char_C, freq=F, main="Distribution of # of characters in Shakespeare's Comedies")
summary(length_of_plays_char_C)

length_of_plays_char_T <- vector(mode="numeric", length=10)
for(j in 1:10){length_of_plays_char_T[j]=nchar(clean_T[[j]][1])}
hist(length_of_plays_char_T, freq=F, main="Distribution of # of characters in Shakespeare's Tragedies")
summary(length_of_plays_char_T)

length_of_plays_char_H <- vector(mode="numeric", length=10)
for(j in 1:10){length_of_plays_char_H[j]=nchar(clean_H[[j]][1])}
hist(length_of_plays_char_H, freq=F, main="Distribution of # of characters in Shakespeare's Histories")
summary(length_of_plays_char_H)

# Statistics on the recap's number of words 
length_of_plays_word_C <- vector(mode="numeric", length=17)
for(j in 1:17){length_of_plays_word_C[j]=length(strsplit(gsub(' {2,}',' ',clean_C[[j]][1]),' ')[[1]])}
hist(length_of_plays_word_C, freq=F, main="Distribution of # of words in Shakespeare's Comedies")
summary(length_of_plays_word_C)

length_of_plays_word_T <- vector(mode="numeric", length=10)
for(j in 1:10){length_of_plays_word_T[j]=length(strsplit(gsub(' {2,}',' ',clean_T[[j]][1]),' ')[[1]])}
hist(length_of_plays_word_T, freq=F, main="Distribution of # of words in Shakespeare's Tragedies")
summary(length_of_plays_word_T)

length_of_plays_word_H <- vector(mode="numeric", length=10)
for(j in 1:10){length_of_plays_word_H[j]=length(strsplit(gsub(' {2,}',' ',clean_H[[j]][1]),' ')[[1]])}
hist(length_of_plays_word_H, freq=F, main="Distribution of # of words in Shakespeare's Histories")
summary(length_of_plays_word_H)
Out[5]:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  47368   61653   70291   69293   74130   91107 
Out[5]:
Out[5]:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  62700   67780   82886   79662   87840  100786 
Out[5]:
Out[5]:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  71811   78630   83087   84154   89740   99953 
Out[5]:
Out[5]:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   6930    9056   10198   10090   10849   13260 
Out[5]:
Out[5]:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   9301   10246   12360   11758   12905   14718 
Out[5]:
Out[5]:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  10488   11535   12142   12270   13089   14475 
Out[5]:

Let's create TDM matrices from clean_C, clean_T, and clean_H, and deal with sparsity.

In [6]:
# Create TDMs 
C_tdm <- TermDocumentMatrix(clean_C)
T_tdm <- TermDocumentMatrix(clean_T)
H_tdm <- TermDocumentMatrix(clean_H)

# Print meta data
C_tdm
T_tdm
H_tdm

# Remove sparse terms
C_tdm <- removeSparseTerms(C_tdm, 0.75)
T_tdm <- removeSparseTerms(T_tdm, 0.75)
H_tdm <- removeSparseTerms(H_tdm, 0.75)

# Print meta data
C_tdm
T_tdm
H_tdm

# Convert to matrices
C_m <- as.matrix(C_tdm)
T_m <- as.matrix(T_tdm)
H_m <- as.matrix(H_tdm)

# Print the dimensions of the matrices
dim(C_m)
dim(T_m)
dim(H_m)
Out[6]:
<<TermDocumentMatrix (terms: 12093, documents: 17)>>
Non-/sparse entries: 43944/161637
Sparsity           : 79%
Maximal term length: 27
Weighting          : term frequency (tf)
Out[6]:
<<TermDocumentMatrix (terms: 10353, documents: 10)>>
Non-/sparse entries: 28987/74543
Sparsity           : 72%
Maximal term length: 62
Weighting          : term frequency (tf)
Out[6]:
<<TermDocumentMatrix (terms: 9877, documents: 10)>>
Non-/sparse entries: 29859/68911
Sparsity           : 70%
Maximal term length: 18
Weighting          : term frequency (tf)
Out[6]:
<<TermDocumentMatrix (terms: 2944, documents: 17)>>
Non-/sparse entries: 29551/20497
Sparsity           : 41%
Maximal term length: 12
Weighting          : term frequency (tf)
Out[6]:
<<TermDocumentMatrix (terms: 3491, documents: 10)>>
Non-/sparse entries: 20700/14210
Sparsity           : 41%
Maximal term length: 12
Weighting          : term frequency (tf)
Out[6]:
<<TermDocumentMatrix (terms: 3757, documents: 10)>>
Non-/sparse entries: 22335/15235
Sparsity           : 41%
Maximal term length: 14
Weighting          : term frequency (tf)
Out[6]:
  1. 2944
  2. 17
Out[6]:
  1. 3491
  2. 10
Out[6]:
  1. 3757
  2. 10

Now for some nice barcharts

In [7]:
# Calculate the rowSums: term_frequency
term_frequency_C <- rowSums(C_m)
term_frequency_T <- rowSums(T_m)
term_frequency_H <- rowSums(H_m)

# Sort term_frequency in descending order
term_frequency_C <- sort(term_frequency_C, decreasing=TRUE)
term_frequency_T <- sort(term_frequency_T, decreasing=TRUE)
term_frequency_H <- sort(term_frequency_H, decreasing=TRUE)

# View the top 20 most common words
term_frequency_C[1:20]
term_frequency_T[1:20]
term_frequency_H[1:20]

# Plot a barchart of the 20 most common words
barplot(term_frequency_C[1:20], col = "tan", las = 2)
barplot(term_frequency_T[1:20], col = "tan", las = 2)
barplot(term_frequency_H[1:20], col = "tan", las = 2)
Out[7]:
will
2378
sir
1555
come
1518
shall
1454
good
1369
love
1343
now
1161
well
1159
man
973
ill
941
let
941
know
922
one
921
say
909
lord
900
make
865
hath
830
like
777
may
696
whi
688
Out[7]:
will
1300
come
984
shall
964
lord
861
good
819
let
777
now
759
well
700
love
688
make
608
know
587
sir
581
man
552
say
540
ill
534
hath
499
like
496
upon
493
tis
483
one
476
Out[7]:
lord
1400
will
1394
shall
1154
king
1103
now
880
good
829
come
798
well
650
let
649
hath
570
make
567
like
550
upon
538
god
527
ill
513
say
507
may
488
man
460
love
451
yet
435
Out[7]:
Out[7]:
Out[7]:

And finally some wordclouds.

In [8]:
# Load wordcloud package
library('wordcloud')

# Create word_freqs
word_freqs_C = data.frame(term_frequency_C)
word_freqs_C$term = rownames(word_freqs_C)
word_freqs_C = word_freqs_C[,c(2,1)]
colnames(word_freqs_C)=c("term","num")

# Create word_freqs
word_freqs_T = data.frame(term_frequency_T)
word_freqs_T$term = rownames(word_freqs_T)
word_freqs_T = word_freqs_T[,c(2,1)]
colnames(word_freqs_T)=c("term","num")

word_freqs_H = data.frame(term_frequency_H)
word_freqs_H$term = rownames(word_freqs_H)
word_freqs_H = word_freqs_H[,c(2,1)]
colnames(word_freqs_H)=c("term","num")

# Create wordclouds 
wordcloud(word_freqs_C$term, word_freqs_C$num, max.words=100, colors="red")
wordcloud(word_freqs_T$term, word_freqs_T$num, max.words=100, colors="blue")
wordcloud(word_freqs_H$term, word_freqs_H$num, max.words=100, colors="black")
Out[8]:
Out[8]:
Out[8]:

To make a commonality cloud and a comparison cloud, we first create a list of all (cleaned) words in the comedies, tragedies, and histories, from clean_C, clean_T, clean_H:

In [9]:
clean_C
clean_T
clean_H
Out[9]:
<<SimpleCorpus>>
Metadata:  corpus specific: 1, document level (indexed): 0
Content:  documents: 17
Out[9]:
<<SimpleCorpus>>
Metadata:  corpus specific: 1, document level (indexed): 0
Content:  documents: 10
Out[9]:
<<SimpleCorpus>>
Metadata:  corpus specific: 1, document level (indexed): 0
Content:  documents: 10
In [10]:
all_c = paste(clean_C[[1]][1],clean_C[[2]][1],clean_C[[3]][1],clean_C[[4]][1],clean_C[[5]][1],clean_C[[6]][1],clean_C[[7]][1],clean_C[[8]][1],clean_C[[9]][1],clean_C[[10]][1],clean_C[[11]][1],clean_C[[12]][1],clean_C[[13]][1],clean_C[[14]][1],clean_C[[15]][1],clean_C[[16]][1],clean_C[[17]][1],collapse=" ")
all_t = paste(clean_T[[1]][1],clean_T[[2]][1],clean_T[[3]][1],clean_T[[4]][1],clean_T[[5]][1],clean_T[[6]][1],clean_T[[7]][1],clean_T[[8]][1],clean_T[[9]][1],clean_T[[10]][1],collapse=" ")
all_h = paste(clean_H[[1]][1],clean_H[[2]][1],clean_H[[3]][1],clean_H[[4]][1],clean_H[[5]][1],clean_H[[6]][1],clean_H[[7]][1],clean_H[[8]][1],clean_H[[9]][1],clean_H[[10]][1],collapse=" ")

Join those terms as three strings:

In [11]:
all_ws = c(all_c,all_t,all_h)

And put it into a corpus:

In [12]:
all_ws = VectorSource(all_ws)
ws_corpus = VCorpus(all_ws)
ws_corpus
Out[12]:
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 3
In [13]:
inspect(ws_corpus)
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 3

[[1]]
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 1177997

[[2]]
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 796626

[[3]]
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 841551

Now we create a TDM for this corpus, which we cast as a matrix object:

In [14]:
ws_tdm = TermDocumentMatrix(ws_corpus)
colnames(ws_tdm) = c("Com.","Trag.","Hist.")
ws_m = as.matrix(ws_tdm)

The commonality cloud can be printed using ... commonality.cloud:

In [15]:
commonality.cloud(ws_m,colors = "darkblue", max.words = 100)
commonality.cloud(ws_m,colors = "darkgreen", max.words = 200)
commonality.cloud(ws_m,colors = "darkred", max.words = 500)
Out[15]:
Out[15]:
Out[15]:

And the comparison cloud using... comparison.cloud:

In [16]:
comparison.cloud(ws_m, colors=c("darkred","darkgreen","darkblue"), max.words = 100)
comparison.cloud(ws_m, colors=c("darkred","darkgreen","darkblue"), max.words = 200)
comparison.cloud(ws_m, colors=c("darkred","darkgreen","darkblue"), max.words = 500)
Out[16]:
Out[16]:
Out[16]:

We can also do pyramid plots by first finding the terms that are common to any two corpora:

In [17]:
common_words_CT = subset(ws_m, ws_m[,1] > 0 & ws_m[,2] > 0)
dim(common_words_CT)
common_words_CH = subset(ws_m, ws_m[,1] > 0 & ws_m[,3] > 0)
dim(common_words_CH)
common_words_TH = subset(ws_m, ws_m[,2] > 0 & ws_m[,3] > 0)
dim(common_words_TH)
Out[17]:
  1. 6437
  2. 3
Out[17]:
  1. 6324
  2. 3
Out[17]:
  1. 5701
  2. 3

The differences in the number of times each token is used in each corpora can be computed with:

In [18]:
difference_CT = abs(common_words_CT[,1] - common_words_CT[,2])
difference_CH = abs(common_words_CH[,1] - common_words_CH[,3])
difference_TH = abs(common_words_TH[,2] - common_words_TH[,3])

Let's bind these new counts to the respective common_word corpora, and order them along the differences:

In [19]:
common_words_CT = cbind(common_words_CT,difference_CT)
common_words_CT = common_words_CT[order(common_words_CT[,4],decreasing=TRUE),]
common_words_CH = cbind(common_words_CH,difference_CH)
common_words_CH = common_words_CH[order(common_words_CH[,4],decreasing=TRUE),]
common_words_TH = cbind(common_words_TH,difference_TH)
common_words_TH = common_words_TH[order(common_words_TH[,4],decreasing=TRUE),]

Let's say we want to plot the top $n=30$ words for each pair of corpora.

In [20]:
n=30
top_df_CT = data.frame(x=common_words_CT[1:n,1], y=common_words_CT[1:n,2], labels=rownames(common_words_CT[1:n,]))
top_df_CH = data.frame(x=common_words_CH[1:n,1], y=common_words_CH[1:n,3], labels=rownames(common_words_CH[1:n,]))
top_df_TH = data.frame(x=common_words_TH[1:n,2], y=common_words_TH[1:n,3], labels=rownames(common_words_TH[1:n,]))
top_df_CT
top_df_CH
top_df_TH
Out[20]:
xylabels
will2378 1300 will
sir1555 581 sir
love1343 688 love
good1369 819 good
come1518 984 come
shall1454 964 shall
master 622 142 master
well1159 700 well
one 921 476 one
man 973 552 man
ill 941 534 ill
now1161 759 now
say 909 540 say
caesar 26 380 caesar
know 922 587 know
hath 830 499 hath
like 777 496 like
whi 688 413 whi
ladi 447 180 ladi
ani 424 164 ani
make 865 608 make
sweet 422 170 sweet
fair 407 158 fair
rome 13 262 rome
can 569 321 can
see 676 429 see
may 696 452 may
mistress 309 67 mistress
mine 519 294 mine
think 537 316 think
Out[20]:
xylabels
sir1555 434 sir
will2378 1394 will
love1343 451 love
king 317 1103 king
come1518 798 come
good1369 829 good
know 922 404 know
one 921 407 one
man 973 460 man
well1159 650 well
lord 900 1400 lord
master 622 186 master
ill 941 513 ill
say 909 507 say
whi 688 359 whi
shall1454 1154 shall
make 865 567 make
let 941 649 let
now1161 880 now
ladi 447 171 ladi
franc 36 310 franc
mistress 309 37 mistress
tis 583 319 tis
fool 306 44 fool
see 676 414 see
england 9 269 england
hath 830 570 hath
veri 412 152 veri
think 537 279 think
must 652 395 must
Out[20]:
xylabels
king231 1103 king
lord861 1400 lord
caesar380 14 caesar
duke 42 327 duke
franc 27 310 franc
princ 54 305 princ
england 31 269 england
love688 451 love
rome262 25 rome
god318 527 god
antoni200 1 antoni
grace112 309 grace
john 4 194 john
shall964 1154 shall
come984 798 come
edward 1 187 edward
know587 404 know
crown 43 223 crown
tis483 319 tis
majesti 43 196 majesti
brutus152 2 brutus
unto 77 226 unto
sir581 434 sir
night293 154 night
natur206 70 natur
day236 368 day
doe157 28 doe
let777 649 let
arm104 230 arm
blood176 298 blood

Finally, we produce the pyramid plot themselves:

In [21]:
library(plotrix)
In [22]:
pyramid.plot(top_df_CT$x,top_df_CT$y,labels=top_df_CT$labels,
             gap=500,top.labels=c("Comedies", "Terms", "Tragedies"), main="Common Terms",
             laxlab=NULL, raxlab=NULL, unit=NULL)

pyramid.plot(top_df_CH$x,top_df_CH$y,labels=top_df_CH$labels,
             gap=500,top.labels=c("Comedies", "Terms", "Histories"), main="Common Terms",
             laxlab=NULL, raxlab=NULL, unit=NULL)

pyramid.plot(top_df_TH$x,top_df_TH$y,labels=top_df_TH$labels,
             gap=500,top.labels=c("Tragedies", "Terms", "Histories"), main="Common Terms",
             laxlab=NULL, raxlab=NULL, unit=NULL)
Out[22]:
  1. 5.1
  2. 4.1
  3. 4.1
  4. 2.1
Out[22]:
Out[22]:
  1. 4
  2. 2
  3. 4
  4. 2
Out[22]:
Out[22]:
  1. 4
  2. 2
  3. 4
  4. 2
Out[22]:

We can also try to look at word associations / phrase nets: nodes represent terms, and links represent connections between terms. First, we put all the contents of the various tragedies in a data frame:

In [23]:
Tragedies=rbind(clean_T[[1]]$content,clean_T[[2]]$content,clean_T[[3]]$content,clean_T[[4]]$content,clean_T[[5]]$content,clean_T[[6]]$content,clean_T[[7]]$content,clean_T[[8]]$content,clean_T[[9]]$content,clean_T[[10]]$content)
Tragedies=as.data.frame(Tragedies)
Tragedies$Tragedies=as.character(Tragedies$V1)

Let's look for associations with the word "master".

THERE'S AN ISSUE WITH ONE OF THE PACKAGES ON THE SERVER; THIS STEP COULD TAKE A FEW HOURS TO RUN. We're leaving the code as an example, but don't worry about running this part.

In [25]:
#word_associate(Tragedies$Tragedies,match.string=c("master"), stopwords=c(Top200Words),network.plot=TRUE,cloud.colors=c("gray70","darkblue"))
#title(main="Terms Associated with 'Master' in Shakespeare's Tragedies")
In [0]: